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ABSTRACT 

The  goal  was  to  recognize  sustained  vowel-like  sounds  and  their 
allophones  in  one  syllable  words.   A  bank  of  filters  and  a  digital 
sampler  provided  a  data  base  for  a  polynomial  curve  fitting  routine. 
The  frequency  range  under  investigation  was  500-1000  Hz.   A  COMCOR  CI 
5000  analog  computer  and  an  XDS  9300  digital  computer  were  used. 
Although  coefficient  correlation  was  ineffective,  several  recommendations 
for  system  improvement  are  made. 
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I.   INTRODUCTION 

Attempts  at  speech  recognition  use  either  special  purpose  hardware 
or  computers.   In  both  cases,  filter  banks  are  often  used.   The  majority 
of  the  work  in  the  field  has  been  formant  and  frequency  analysis. 

The  goal  was  to  achieve  a  recognition  algorithm  for  sustained  vowel- 
like sounds  and  their  allophones  in  one  syllable  words.   It  was  assumed 
that  a  voiced  audio  signal  could  be  broken  into  eight  frequency  bands 
ranging  from  500  to  1000  Hz  and  the  respective  audio  curves  fitted  to 
polynomials.   It  was  further  assumed  that  similar  curves  have  similar 
coefficients. 

A  hybrid  system,  consisting  of  a  COMCOR  CI  5000  analog  computer  and 
a  Xerox  Data  Systems  9300  digital  computer,  is  used  to  effect  a  speech 
recognizer.   Figure  1  is  a  diagram  of  the  system. 

Two  experiments  were  conducted  prior  to  system  implementation. 
Experimentation  using  various  frequency  ranges  was  attempted  in  order  to 
resolve  a  frequency  conflict.   In  the  first  experiment  subjects  listened 
to  random  words,  whereas  in  the  second  experiment  brush  recordings  of 
the  same  words  were  studied. 

The  heart  of  the  analog  system  is  a  parallel  bank  of  eight  band-pass 
filters.   Their  output  is  smoothed,  sampled,  and  sent  to  the  XDS  9300 
for  analysis.   Figure  2  is  a  diagram  of  the  complete  analog  system. 

A  digital  program  performs  a  fifteenth  degree  polynomial  fit  on  each 
of  the  eight  audio  curves  that  are  sampled  from  the  analog  computer. 
The  program  then  outputs  eight  sets  of  normalized  coefficients  for  ele- 
mentary analysis.   System  noise  is  eliminated  digitally,  and  zero  data 


points  occurring  at  the  end  of  words  are  completely  overlooked  by  the 
polynomial  fitting  routine.   Several  program  modifications  were  incor- 
porated and  their  results  discussed. 


II.   BACKGROUND  INFORMATION 

A.   TRADITIONAL  APPROACHES 

In  the  investigation  of  speech  recognition  by  the  direct  analysis 
of  a  speech  wave  (Reddy,  1966),  the  goal  was  to  produce  a  phonemic 
transcription  of  a  connected  utterance  which  is  readable  and  bears  a 
satisfactory  resemblance  to  what  was  said.   The  problem  was  confined  to 
a  single  cooperative  speaker  so  that  writing,  adjusting  and  testing 
programs  would  be  easier.   It  was  felt  that  a  "tune-in"  process  would 
adapt  the  program  to  a  wider  variety  of  speakers.   No  attempt  was 
made  to  group  the  phonemes  into  words  or  higher  level  linguistic  units. 

The  concepts  which  were  considered,  such  as  amplitude  normalization 
and  time  normalization,  show  some  insight.   In  the  case  of  sustained 
sounds  and  one  syllable  words,  though,  time  normalization  may  not  be 
necessary.   It  does  not  seem  realistic,  however,  that  the  "tune-in" 
process  could  overcome  the  lack  of  generality  in  the  original  program. 

A  procedure  for  segmenting  connected  speech  (Reddy  and  Vicens ,  1968) 
performs  smoothing  and  differencing  operations  on  the  digitized  acoustic 
waveform  to  generate  parameters  which  are  used  to  determine  whether  the 
characteristics  of  a  sound  are  changing  or  similar.   Parts  that  possess 
similar  parameters  are  grouped  together  to  form  sustained  segments, 
resulting  in  the  segmentation  of  connected  speech  into  parts  approximately 
corresponding  to  phonemes. 

Smoothing  looks  like  a  reasonable  operation  to  perform  on  waveforms 
before  they  are  compared.   A  question  that  arises,  though,  is  whether 
the  smoothing  should  be  done  in  the  analog  circuit  or  after  the  information 
has  been  digitized.   Perhaps,  too,  one  smoothing  operation  is  not  enough. 


A  successful  limited  speech  recognition  system  (Bobrow  and  Klatt,  1968) 
operates  within  limitations  along  a  number  of  dimensions.   Rather  than 
use  continuous  speech  in  which  segmentation  is  a  problem,  the  approach 
is  to  work  with  messages  with  easily  delimited  beginning  and  termination 
points.   The  set  of  messages  is  limited  in  number;  at  any  one  time  the 
vocabulary  to  be  distinguished  can  contain  up  to  about  100  items. 
However,  an  item  need  not  be  a  single  word,  but  may  be  any  short  phrase. 
The  system  is  useable  by  any  male  speaker,  but  must  first  be  trained  by 
him.   The  system,  LISPER,  is  not  designed  to  work  well  simultaneously 
for  a  number  of  different  speakers,  or  achieve  good  recognition  scores  for 
an  unknown  speaker.   The  training  period  consists  of  a  period  of  closed 
loop  operation  in  which  the  speaker  says  an  input  message,  the  system 
guesses  what  he  says,  and  he  responds  with  the  correct  message.   The 
recognition  algorithm  is  a  program  that  learns  to  identify  words  by 
associating  the  outputs  of  various  property  extractors  with  them.   Each 
property  has  a  corresponding  feature  state  which  may  imply  that  the  property 
is  irrelevant  for  the  current  time  interval,  the  property  is  relevant 
but  not  present,  or  the  property  is  both  relevant  and  present. 

Several  advantages  of  this  approach  are: 

1.  A  precise  segmentation  of  the  utterance  is  not  required. 

2.  The  utterance  need  not  be  a  single  word. 

3.  Features  may  be  added  to  the  system  to  provide  desirable  redundancy. 
A.   The  feature  approach  permits  the  introduction  and  testing  of 

linguistic  hypotheses. 


'Two  main  disadvantages  are: 

1.  The  current  implementation  is  not  speaker  independent. 

2.  The  system  will  degrade  in  performance  as  the  length  of  the 
vocabulary  is  increased  or  as  the  number  of  speakers  that  it  can 
simultaneously  recognize  is  increased. 

The  differential  effects  upon  vowel  intelligibility  of  various  degrees 
of  time  compression  and  frequency  division  were  examined  both  with  and 
without  time  restoration  (Daniloff,  Shriner  and  Zemlin,  1968).   A  male 
speaker  and  a  female  speaker  were  used.   For  a  given  percentage  of  dis- 
tortion, frequency  division  degrades  vowel  intelligibility  more  severely 
than  time  compression.   Restoring  time  to  normal  for  frequency-division 
speech  does  not  enhance  intelligibility.   Vowel  confusions  under  time 
compression  are  related  to  duration;  those  for  frequency  division 
conditions  appear  to  be  closely  related  to  the  perception  of  Vowel 
Formant  Two,  and  to  a  lesser  degree,  Vowel  Formant  One.   Patterns  of 
male  and  female  vowel  confusions  are  generally  much  alike  for  all 
conditions  and  types  of  distortion.   Results  tentatively  indicate 
superior  female  vowel  intelligibility  under  all  conditions  of  distortion, 
the  advantage  being  largest  for  frequency  division  and  somewhat  less 
for  time  compression.   These  results  suggest  that  over  a  limited  range 
of  frequency  division  up  to  forty  percent,  vowel  phonemic  quality  is 
relatively  unaffected  by  proportionate  shifting  of  fundamental  frequency 
and  formant  structure,  indicating  that  a  "relative-vowel"  hypothesis 
of  vowel  phonemic  quality  may  hold  for  limited  shifts  in  the  frequency  of 
vowel  spectra. 


The  idea  that  vowel  phonemic  quality  may  hold  during  normalization 
is  extremely  important.   However,  the  statement  that  vowel  confusions 
under  time  compression  are  related  to  duration  conflicts  with  another 
study  (Seo,  1968).   The  method  yields  time  compressed  speech  which  is 
of  normal  pitch,  and  highly  intelligible.   It  utilizes  a  systematic 
approach  in  which  portions  of  phonemes  are  sectioned  out  without 
destroying  cognitive  qualities. 

Another  process  for  the  extraction  of  significant  parameters  of  speech 
involves  division  of  the  speech  spectrum  into  convenient  frequency  bands, 
and  calculation  of  amplitude  and  zero-crossing  parameters  in  each  of  these 
bands  every  ten  milliseconds  (Vicens ,  1969).   In  the  software  implementation, 
a  smoothing  function  divides  the  speech  spectrum  into  two  frequency 
bands  (above  and  below  1000  Hz).   In  the  hardware  implementation,  the 
spectrum  is  divided  into  three  bands  using  bandpass  filters  (150-900  Hz, 
900-2200  Hz,  and  2200-5000  Hz). 

As  in  many  other  approaches,  considerable  effort  is  spent  investi- 
gating from  one-fourth  to  one-half  the  range  of  human  hearing.   Although 
this  may  be  the  correct  approach  to  take,  the  experiments  discussed  in 
the  next  section  would  seem  to  indicate  otherwise. 

In  an  interview  at  Stanford  Research  Institute  (Walker,  1972)  it  was 
suggested  that,  rather  than  concentrate  solely  on  sustained  sound,  it 
might  be  worthwhile  to  look  at  the  dynamics  of  sounds.   It  was  further 
suggested  that  the  upper  limit  of  the  frequency  range  to  be  investigated 
be  increased  to  10  KHz. 

An  earlier  conversation  with  some  of  the  technical  people  at 
Pacific  Telephone  revealed  that  a  frequency  range  of  500-1000  Hz  would 
result  in  a  highly  intelligible  sound  to  a  human  listener.   If  this  is 
the  case,  either: 
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1.  The  intelligibility  is  context  dependent. 

2.  A  significant  speech  parameter  is  being  overlooked  by  the  people 
who  are  investigating  the  frequencies  above  1000  Hz,  feeling  that  such 
investigation  is  necessary  to  insure  adequate  information. 

In  particular,  a  considerable  amount  of  time  is  spent  looking  for 
significant  vowel  information  between  3000  and  4000  Hz.   Section  II  will 
discuss  this  conflict  in  more  detail. 

B.  GOAL 

The  initial  goal  was  to  attempt  to  program  a  hybrid  system  to  recog- 
nize phonemes,  or  basic  sustained  sounds,  with  particular  emphasis  on  the 
differences  of  similar  sounds.   The  sustained  sound,  however,  is  static 
and  therefore  unrealistic  in  nature.   The  goal  was  then  modified  so 
that  the  investigation  would  include  some  sustained  vowel-like  sounds, 
then  some  one  syllable  words  containing  those  sounds,  and  finally  an 
attempt  to  break  down  the  word  to  study  the  dynamics  of  the  vowel-like 
sound. 

C.  PRELIMINARY  ASSUMPTIONS 

The  original  premise  was  that  the  voiced  sound  could  be  broken  into 
different  frequency  ranges,  and  that  a  subroutine  could  be  used  that 
would  perform  a  polynomial  fit  to  each  of  the  filtered  audio  signals. 
The  coefficients  from  these  fits  would  then  be  used  as  a  data  base  for 
phoneme  recognition.   This  implies  that  similar  curves  will  have  similar 
coefficients.   A  comparison  of  the  coefficients  from  two  sets  of  data 
that  are  supposed  to  represent  the  same  sound  leads  to  the  theory  that  a 
unique  correlation  exists  in  some  subset  of  those  coefficients. 
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Correlation  infers  that  some  subset  of  coefficients  of  a  sound  is  a 
multiple,  plus  or  minus  some  error  tolerance,  of  the  same  subset  of 
coefficients  of  the  same  sound  said  at  another  time.   This  subset  will 
be  referred  to  from  now  on  as  the  "characteristic  subset"  of  a  sound. 

D.  VOCABULARY 

Any  sound  that  is  not  a  single  vowel-like  sound  or  a  one  syllable 
English  word  containing  that  vowel  sound  is  outside  the  domain  of  dis- 
cussion.  A  vowel-like  sound  excludes  some  vowel  pronounciations ,  such  as 
_i;  it  includes  dipthongs  such  as  ou  in  the  word  though.   However,  ou  is 
excluded  in  words  such  as  out. 

E.  SYSTEM  OVERVIEW 

There  are  three  phases  to  speech  recognition: 

1.  Manipulate  and  sample  an  analog  signal. 

2.  Digitally  analyze  the  samples  obtained  from  the  analog  computer. 

3.  Apply  a  recognition  algorithm  to  the  results  of  the  digital 
analysis . 

In  this  research,  an  audio  signal  is  filtered  into  eight  pass  bands 
after  a  comparator  is  keyed  by  the  excitation  voltage.   The  output  from 
the  filters  is  smoothed  prior  to  the  digital  sampler.   At  the  point  of 
smoothing,  the  envelopes  of  the  filtered  signals  may  be  looked  at  on  the 
brush  recorder.   The  digitized  samples  are  passed  to  a  software  buffer  in 
the  digital  program.   After  sampling  is  complete,  program  analysis 
attempts  to  fit  the  sample  points  with  a  high  order  polynomial. 

Two  of  the  three  phases  have  been  satisfied.   The  current  state  of  the 
project  does  not  use  a  recognition  algorithm. 
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III.   INITIAL  EXPERIMENTATION 

There  was  a  contradiction  between  the  information  gathered  at  SRI 
about  relevant  frequency  ranges  and  that  obtained  from  Pacific  Telephone. 
Consequently,  experimentation  was  begun  by  wiring  two  Kronhite  filters 
in  series  to  create  a  band-pass  filter  with  a  variable  range.   After 
a  microphone  input  and  an  earphone  output  were  connected,  the  upper  and 
lower  bounds  of  the  pass  band  were  varied  to  determine  the  comprehensabi- 
lity  of  randomly  selected  words.   Several  sets  of  twenty-five  random 
words  were  chosen  to  be  read  by  three  speakers ,  including  one  female 
speaker.   The  listener  was  to  wear  the  headphones  and  write  down  each 
word  as  he  heard  it.   Eight  listeners  were  selected,  given  no  background 
information,  and  asked  to  put  on  the  headset,  face  away  from  the  speaker, 
and  write  down  whatever  words  they  heard.   By  so  doing,  no  visual  aids  to 
speech  perception  were  available  to  the  listener  (i.e.,  lip  movement). 
Furthermore,  care  was  taken  to  ensure  that  the  listener  could  not 
hear  anything  except  what  came  through  the  headset. 

The  initial  frequency  range  used  was  500-1000  Hz  as  this  was  the 
range  of  primary  interest.   It  was  found  that  the  comprehensability 
of  the  words  that  were  selected  ranged  frcm  a  low  of  17  out  of  25 
correct  to  a  high  of  19  out  of  25  correct;  the  largest  majority  being 
centered  at  18  out  of  25  words.   In  100%  of  the  cases,  the  vowel 
sounds  were  totally  perceptible.   Also  in  every  case,  the  sounds  that 
were  incorrectly  transcribed  were  words  beginning  with  _th,  d_,  _f ,  and  s_, 
the  sounds  all  sounding  somewhat  alike  to  every  listener.   The  next  step 
was  to  change  the  lower  bound  of  the  filter  to  zero  in  order  to  discover 


13 


any  further  information  that  might  be  available  at  the  lower  frequencies. 
In  looking  at  the  results  of  these  tests,  it  was  determined  that  no 
increase  in  information  was  gained.   The  conclusion  was  that  the  lower 
bound  of  500  Hz  was  reasonable. 

The  next  frequency  range  investigated  was  1000-2000  Hz,  with  some- 
what startling  results,  for  there  was  almost  a  total  loss  of  word 
recognition.   This  made  the  frequency  range  of  500-1000  Hz  a  necessary 
condition  for  speech  recognition. 

As  a  check  on  the  primary  upper  limit  of  1000  Hz.,  the  range  500  to 
2000  Hz.  was  investigated.   This  was  done  to  establish  an  upper 
frequency  bound  on  the  remaining  information.   This  proved  to  be  suffi- 
cient as  a  one  hundred  percent  comprehension  from  all  listeners  was 
obtained.   To  further  narrow  down  this  critical  range,  the  upper  limit 
of  the  band  pass  was  lowered  to  1500  Hz.   It  was  found  that  the  same 
level  of  understanding  was  present.   This  upper  level  was  lowered  to 
1400  Hz.  without  any  information  loss,  but  below  this  level  the  same 
difficulties  were  encountered  as  in  the  primary  frequency  range  (i.e., 
500-1000  Hz.). 

The  preceding  experiment  brought  to  light  a  salient  point:   Human 
beings  possess  some  other  faculty  for  speech  understanding  besides  just 
a  complete  frequency  spectrum  analysis.   But  there  are  obviously  critical 
frequency  ranges  because  all  words  could  not  be  understood  at  frequencies 
outside  the  critical  range. 

It  should  also  be  noted  that  obtaining  center  frequencies  for  filters 
in  the  range  around  1500  Hz.  is  very  unreliable  due  to  the  inaccuracy 
of  the  hardware.   This  is  so  because  the  CI-5000  was  designed  to  work 
efficiently  only  at  frequencies  below  1000  Hz. 
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At  this  stage  of  the  experimentation  the  brush  recorder  indicated  the 
original  premise,  500-1000  Hz  is  both  a  necessary  and  a  sufficient  condi- 
tion for  speech  recognition,  was  correct.   Efforts  were  concentrated  on 
looking  at  the  words  and  sounds  which  were  earlier  confused  by  the 
listeners.   After  several  recordings,  the  fact  was  established  that  there 
were  differences  between  the  difficult  to  discern  words  in  the  upper 
frequency  ranges  (800-1000  Hz) . 

Based  on  the  results  of  the  experiments,  it  would  be  reasonable  to 
expect  the  primary  frequency  range  to  contain  enough  information  to 
make  speech  recognition  possible. 
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IV.   ANALOG  SYSTEM 

The  input  to  the  analog  system  is  a  microphone,  the  audio  output 
of  which  goes  through  a  pre-amplif ier  and  from  there  is  fed,  via  the  ".   . 
keying  circuit,  to  a  bank  of  eight  paralleled  band-pass  filters.   The 
output  of  each  filter  is  connected  to  a  smoothing  circuit,  and  from 
there  to  the  channels  of  the  digital  sampler,  which  in  turn  feeds  data 
to  the  digital  computer  (see  figure  2) . 

A.   COMPARATIVE  NETWORK 

The  comparative  network  (see  figure  3,  part  A.)  acts  as  a  keying 
circuit  for  the  analog  system.   Its  function  is  to  start  the  analog 
data  gathering  when  a  person  speaks  into  the  microphone.   This  was 
necessary  in  order  to  minimize  the  timing  problem  of  speech  recognition. 

The  diagram  shows  two  inputs  to  the  comparator  (  C  00  ) ;  one  being 
the  audio  input  signal  and  the  other  a  reference  signal.   By  adjusting 
the  potentiometer  (P) ,  the  exciting  voltage  level  can  be  altered.   It 
is  normally  set  just  above  the  noise  level  so  that  random  noise  will  not 
accidentally  key  the  circuit. 

The  output  of  the  comparator  is  normally  false  or  zero;  when  the 
circuit  is  keyed,  even  for  an  instant,  delay  flip  flop  zero  (DFO) 
changes  from  false  to  true  for  a  period  of  time  determined  by  a  dial 
setting.   This  in  turn  puts  a  true  signal  into  T100  (TEST(l)  in  digital 
program)  and  interrupt  52  is  enabled. 

In  order  to  control  the  system  input,  a  digital  three  position  switch 
(DSO)  is  employed.   As  long  as  the  switch  is  in  the  middle  or  ground 
position,  it  acts  as  a  short  circuit  and  prevents  T100  from  going  true. 
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When  placed  in  either  of  the  two  true  positions,  it  acts  as  an  open 
circuit  and  T100  can  be  enabled.   Thus,  to  key  the  system,  DSO  must  be 
set  to  true  and  the  speaker  must  then  excite  the  circuit. 

B.  BAND-PASS  FILTER 

It  was  necessary  to  build  eight  band-pass  filters  on  the  CI-5000 
analog  computer.   They  had  to  be  realizable  component-wise.   Most  textbook 
filters  were  realizable,  but  impractical  as  eight  could  not  be  made  with 
the  existing  hardware.   The  filter  chosen  was  selected  with  reluctance 
for  although  it  met  the  aforementioned  requirements,  it  was  a  low  Q 
or  low  resolution  filter. 

The  diagram  (figure  4)  shows  two  amplifiers  (A-^  and  A2) ,  two  inte- 
grators (Ii  and  I2)  and  three  potentiometers  (P-^  thru  P3)  .   Potentiometer 
one  controls  the  center  frequency  of  the  filter,  while  potentiometers 
two  and  three  control  the  band  width.   Table  one  lists  the  actual 
components  used  and  Table  two  lists  both  the  associated  potentiometer 
settings  and  the  filter  frequency  ranges. 

C.  SMOOTHING  CIRCUIT 

A  smoothing  circuit  was  incorporated  into  the  system,  again,  due  to 
hardware  limitations;  this  will  be  discussed  in  detail  in  the  Digital 
Program  Development  section  under  smoothed  data.   The  output  of  each  of 
the  filters  is  fed  into  a  separate  smoother  and  from  there  to  separate 
channels  of  the  digital  sampler.   The  function  of  the  circuit  is  to  trace 
the  envelope  of  the  audio  curve. 
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D.   SAMPLING  FREQUENCY 

The  sampling  frequency  is  controlled  by  two  things;  first,  the 
frequency  generator  used  and  second,  the  frequency  divider  (PSET  CTR) 
(see  figure  3,  part  B.).   In  order  to  attain  a  sample  frequency  of  200 
samples  per  second  a  10  K c frequency  generator  is  used  in  conjunction 
with  a  division  by  50,  set  into  the  PSET  CTR.   This  generates  a  pulse 
into  delay  flip  flop  one  (DF1)  every  five  milliseconds.   DF1,  in  turn, 
enables  interrupt  52  for  .1  millisecond  during  which  time  a  sample  is 
taken  by  the  eight  used  channels  of  the  digital  sampler  simultaneously. 
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V.   DIGITAL  PROGRAM  DEVELOPMENT 

The  output  from  the  CI  5000  is  transferred  to  the  XDS  9300  by 
means  of  a  hardware  link  between  the  two  machines.  When  an  interrupt 
occurs,  control  is  transferred  to  the  subroutine  which  handles  the  buffer 
indexing,  and  which  also  calls  the  system  subroutine  which  loads  the 
buffer.   The  digitized  samples  from  the  analog  computer  are  stored  in 
the  buffer  until  the  complete  set  of  data  has  been  gathered.   Once  this 
has  occurred  the  interrupt  is  disabled  and  the  analysis  begins. 

An  orthogonal  least-squares  curve-fitting  technique  is  applied  to 
the  data  from  each  of  the  eight  filters,  and  the  resulting  polynomial 
coefficients  are  printed.   The  coefficients  are  used  to  compute  values 
for  the  dependent  variable,  which  is  currently  plotted  by  hand  to  compare 
to  brush  recordings  of  the  same  data. 

A.   INPUT  DATA  AVERAGING 

Due  to  core  limitations,  which  will  be  discussed  in  the  following 
section,  there  was  not  sufficient  space  to  store  all  of  the  samples 
taken  if  the  sampling  frequency  was  high  (i.e.,  around  1000  Hz). 
Therefore,  an  averaging  technique  was  employed.   What  actually  occurred 
was  simply  a  temporary  buffering  of  a  summation  of  several  consecutive 
points  before  their  inocrporation  into  the  data  set  to  be  used  by  the 
curve-fitting  routine.   From  two  to  ten  points  were  averaged  at  various 
times.   This  technique  was  later  found  to  be  unnecessary  and  too  costly 
timewise,  and  was  therefore  eliminated. 
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B.   CHANGING  SAMPLING  FREQUENCY 

An  initial,  but  mistaken,  assumption  was  that  samples  could  be  taken 
up  to  and  including  one  sample  every  millisecond  on  each  filter.   Thus, 
for  each  channel  one  thousand  data  points  could  theoretically  be  obtained 
over  a  period  of  one  second.   However,  due  to  limitations  of  core  storage, 
a  maximum  sample  size  of  500  data  points  per  filter  became  the  upper 
limit.   This  limit  could  have  been  extended  by  the  use  of  overlaying 
techniques  in  the  XDS  9300  memory,  but  these  techniques  were  found  to 
be  too  slow  to  effectively  take  data  at  higher  rates.   The  data  that 
were  obtained  in  using  sample  frequencies  up  to  500  samples  per  second 
had  large  discrepancies.   There  was  an  even  more  severe  limitation  in 
the  sampling  frequency  in  that  samples  could  not  be  taken  any  faster 
than  200  points  per  second;  thus,  one  sample  every  five  milliseconds. 
The  problem  that  existed  at  higher  frequencies  was  that  the  buffering 
subroutine  was  too  slow,  causing  a  stacking  of  analog  interrupts  and 
resulting  in  lost  data  points. 

Now  that  an  upper  bound  had  been  established  for  both  the  sample 
frequency  and  the  sample  size,  samples  could  be  taken  over  a  total  time 
interval  of  two  and  one  half  seconds.   However,  because  of  the  nature  of 
the  previously  defined  vocabulary,  samples  need  only  be  taken  for  one 
second  or  less,  with  the  mainstream  of  words  lasting  only  one-half  to 
three-quarters  of  a  second.   It  was  for  this  reason  that  the  sample  data 
set  normally  consisted  of  one  hundred  or  one  hundred  and  fifty  data 
points  representing  one-half  or  three-quarters  of  a  second  respectively. 
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C.   RAW  VERSUS  SMOOTHED  DATA 

Early  in  the  research,  the  data  was  being  fed  directly  from  the 
hardware  filters  to  the  digital  sampler,  the  resulting  data  being  termed 
"raw  data."   In  attempting  to  look  at  the  representative  plots  on  the 
brush  recorder,  it  was  discovered  that  the  frequency  of  the  filtered 
audio  signal  was  too  high  for  the  brush  recorder's  mechanical  recording 
arm  to  follow  accurately.   In  order  to  alleviate  this  problem,  a 
smoothing  circuit  was  constructed  external  to  the  analog  computer 
(figure  5).   The  function  of  this  circuit  was  to  smooth  the  data  in 
such  a  way  as  to  present  the  envelope  of  the  original  high-frequency 
curve.   The  plotting  of  this  curve  was  within  the  mechanical  capability 
of  the  brush  recorder,  and  in  fact  led  to  the  next  step  in  data  manipu- 
lation.  For  it  was  this  smoothed  curve  that  was,  in  fact,  interesting. 
Therefore,  instead  of  the  data  being  fed  directly  from  the  analog 
filters  to  the  digital  sampler,  the  signal  was  smoothed  first  (see 
figure  2) . 

Sampling  the  higher  frequency  curve  often  gave  misrepresentative 
data,  whereas  sampling  the  envelope  resulted  in  much  more  consistent  data. 
The  curve  obtained  by  sampling  the  raw  data  was  found  to  be  dependent 
upon  two  factors:   (1)  the  initial  point  of  sampling;  and  (2)  the 
sampling  frequency  used.   This  was  not  the  case  when  sampling  on  the 
envelope  of  the  curve,  for  it  was  immaterial  where  the  sampling  started 
or  what  the  interval  was;  the  curve  remained  almost  the  same  using 
recorded  input. 
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D.  NOISE 

As  was  just  mentioned,  the  curves  that  came  from  recorded  input  were 
almost  the  same.   It  was  this  fact  that  led  to  the  assumption  that 
random  noise  was  present  in  the  system.   The  primary  question  was  just 
how  extensively  the  noise  affected  the  input  data.   A  way  to  determine  this 
was  to  reduce  the  keying  bias  to  zero,  thereby  causing  the  analog  program 
to  take  data  without  an  exciting  voltage.   Thus,  the  only  data  taken  would 
be  noise  in  the  system. 

After  several  data  runs  of  this  type,  the  magnitude  of  the  noise 
was  found  to  be  approximately  one  one-thousandth  that  of  the  desired 
input.   It  was  therefore  decided  to  truncate  all  information  that  was 
contained  at  the  noise  level  and  retain  only  three  significant  digits 
from  the  direct  analog  input.   To  ensure  that  the  method  was  successful 
the  initial  testing  process  used  in  finding  the  noise  was  rerun.  With 
a  zero  input  to  the  system,  all  data  was  successfully  truncated  to  zero. 
Furthermore,  identical  inputs  produced  more  nearly  identical  outputs.. 

E.  NORMALIZATION 

In  attempting  to  compare  two  sets  of  coefficients,  it  was  noticed 
that  there  was  often  a  correlation  if  a  scaling  factor  was  applied 
to  one  of  the  sets  of  coefficients.   The  difference  in  the  size  of  the 
coefficients  was  possibly  due  to  the  change  in  volume  when  saying  a  word 
from  trial  to  trial.   Consequently,  the  coefficients  would  differ  from 
trial  to  trial.   Thus,  an  attempt  was  made  to  normalize  the  equations 
based  on  the  setting  of  the  high  order  coefficient  to  a  particular  constant 
thereby  causing  the  other  coefficients  to  be  scaled. 
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This  technique  gave  very  promising  results  for  discrete  sets  of 
trials,  but  when  the  intersection  of  the  characteristic  subsets  was 
taken,  the  resulting  subset  was  found  to  be  empty,  as  no  correlation 
could  be  obtained  for  all  data.   One  of  the  interesting  points  that  this 
particular  method  reinforced  was  the  fact  that  it  was  much  easier  to 
attempt  correlation  with  a  single  speaker  than  to  attempt  correlation 
between  different  speakers. 

It  is  important  to  note  that  the  aforementioned  normalization  is  only 
amplitude  normalization.   The  concept  of  time  normalization  has  not  been 
employed,  because  its  importance  has  been  realized  only  in  the  most 
recent  stages  of  research.   The  idea  of  time  normalization  will  be 
treated  later  in  the  paper. 

F.   VARIABLE  WEIGHTING  FUNCTION 

Initially  it  was  felt  that  unweighted  data  would  suffice  in  the 
analysis  of  a  filtered  signal.   The  reasoning  was  that  if  sounds  could 
be  distinguished  visually  on  the  brush  recorder,  then  fixed  time  sampling 
using  a  ten  millisecond  time  interval  would  yield  satisfactory  results. 

Consideration  was  then  given  to  the  idea  of  equating  the  weight 
given  to  a  particular  data  point  to  the  value  of  the  data  point.   The 
intent  of  this  was  to  emphasize  the  larger  peaks  and  deemphasize  the 
smaller  peaks.   By  so  doing,  the  curve  fitting  routine  would  place 
greater  weight  on  the  peaks  when  calculating  coefficients.   This  was 
also  intended  to  give  a  zero  weight  to  data  points  with  zero  value. 

If  the  sound  being  analyzed  does  not  cover  the  full  time  interval 
that  is  being  sampled,  then  zero  data  points  appear  at  the  end  of  the 
data  set.   This  causes  the  curve  fitting  routine  to  attempt  to  fit  not 
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only  the  non-zero  data  points ,  but  the  zero  data  points  along  the  x-axis 
as  well.   By  requiring  the  polynomial  to  fit  the  x-axis,  it  was  believed 
that  less  accurate  results  would  be  produced  than  if  the  fit  were 
restricted  just  to  the  non-zero  data  points.   The  problem  was  alleviated 
by  setting  the  weights  of  the  zero  data  points  equal  to  zero. 

The  equating  of  weights  to  values  neglected  the  possibility  that  a 
small  amplitude  segment  of  the  curve  might  be  a  significant  part  of  the 
curve.   Thus,  it  would  be  underweighted  and  underemphasized  in  the  curve 
fitting  routine;  a  large  amplitude  segment  that  may  not  be  of  signifi- 
cance would  be  overweighted  and  overemphasized.   Thus,  the  coefficients 
would  be  out  of  proportion  to  the  significance  of  the  curve.   Therefore, 
all  except  the  zero  weights  were  eliminated. 

G.   TIME  SCALING 

The  initial  interval  between  data  points  was  arbitrarily  chosen  to 
be  one  in  the  curve  fitting  routine.   The  resulting  coefficients  were 
out  of  proportion  in  that  the  low  degree  coefficients  were  many  orders 
of  magnitude  larger  than  the  high  degree  coefficients.   In  the  comparison 
of  coefficients  of  supposedly  similar  curves,  the  high  order  coefficients 
are  far  more  important  than  the  low  order  coefficients.   Therefore, 
it  was  necessary  to  choose  a  more  appropriate  interval  that  would  decrease 
the  relative  magnitudes  of  the  coefficients. 

The  interval  size  is  inversely  proportional  to  the  number  of  data 
points  being  used.   The  use  of  200  sample  points  requires  an  interval  of 
0.1  units,  whereas  the  use  of  100  points  requires  an  interval  size  of 
0.2  units.   This  size  requirement  is  based  on  the  present  state  of  the 
program. 
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H.   SECOND  DEGREE  SMOOTHING 

Requiring  a  polynomial  to  fit  a  curve  with  many  relative  maximums 
and  minimums,  many  of  which  occur  within  a  very  short  distance,  causes 
the  coefficients  to  inaccurately  represent  the  envelope  of  the  curve. 
By  eliminating  the  minimum  points,  and  keeping  only  the  maximum  points, 
a  second  degree  smoothing  was  effected.   A  copy  of  the  program  segment 
used  to  accomplish  this  can  be  found  at  the  end  of  the  computer  program 
section. 

This  method  was  discarded  under  the  current  program  configuration 
because  it  eliminated  not  only  unimportant  segments  of  the  curve,  but 
it  also  under  certain  circumstances  eliminated  salient  features  of  the 
curve . 

I.   DEGREE  OF  POLYNOMIAL  FIT 

In  looking  at  the  brush  recordings  of  some  of  the  words  used,  it  is 
difficult  to  determine  just  what  degree  of  polynomial  fit  is  necessary 
to  get  an  accurate  representation  of  the  curve  in  terms  of  coefficients. 
At  first,  a  twentieth  degree  fit  was  used  under  the  assumption  that 
the  larger  the  degree  of  the  polynomial  the  better  the  fit.   After 
plotting  some  of  the  resultant  curves,  it  became  obvious  that  although 
a  twentieth  degree  fit  was  appropriate  for  some  of  the  curves,  it  was 
too  great  a  degree  of  fit  for  others  because  minor  variations  in  the 
curve  were  emphasized.   A  tenth  degree  fit  was  then  tried  in  order  to 
give  a  better  average  result  for  all  of  the  curves.   This,  too,  was 
inappropriate  in  that  it  was  too  small  a  degree  of  fit.   The  present 
program  performs  a  fifteenth  degree  fit  for  all  curves. 
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VI.   SUMMARY 

Although  the  current  system  does  not  recognize  speech,  some  combi- 
nation of  the  present  program  and  the  recommendations  made  may  lead  to  a 
speech  recognizer.   Two  hardware  limitations  were  encountered;  it  was 
impossible  to  construct  eight  high  resolution  filters  on  the  CI  5000;  and 
there  was  insufficient  direct  access  core  storage  in  the  XDS  9300. 
Consequently,  low  resolution  filters  and  a  small  sample  size  had  to  be 
used.   One  system  software  limitation  was  encountered;  the  data  transfer 
subroutine,  ADL,  was  found  to  be  too  slow,  thus  prohibiting  high 
frequency  sampling. 

Based  on  the  initial  experimentation,  and  the  results  obtained  thus 
far,  it  is  possible  that  at  least  one  significant  speech  parameter  is 
being  overlooked.   Although  frequency  and  formant  analysis  may  be 
necessary,  they  are  not  sufficient  for  a  generalized  speech  recognizer. 

Each  word  and  sound  investigated  contained  a  basic  wave  shape,  but 
due  to  pronunciation  differences,  the  shape  was  altered  sufficiently  that 
coefficient  correlation  was  not  effective.   The  extracting  of  distinctive 
portions  of  the  curve  that  remain  the  same  from  trial  to  trial  should 
lead  to  a  greater  degree  of  correlation. 
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VII.   RECOMMENDATIONS 

A.  CURVE  AVERAGING 

Instead  of  comparing  coefficients  per  se,  an  averaging  of  input  data 
points  from  trial  to  trial  and  a  study  of  the  resulting  coefficients, 
appears  to  be  a  promising  approach  to  the  problem  of  speech  recognition 
using  the  previously  described  system.   This  would  entail  using  overlaying 
techniques  in  the  XDS-9300  system. 

The  main  problem  associated  with  this  approach  is  one  of  timing; 
the  beginning  and  end  of  the  curves  must  coincide  to  be  averaged. 

B.  TIME  NORMALIZATION 

The  timing  problem  just  mentioned  in  the  previous  section  bears 
rectification  immaterial  of  what  other  future  changes  are  made  to  the 
program.   A  curve  that  is  stretched  over  a  longer  distance  bears  little 
resemblance  to  the  unstretched  curve  coefficient-wise.   For  this  reason, 
any  future  polynomial  curve  fitting  approach  must  take  into  account  the 
problem  of  sound  duration. 

C.  SEGMENTED  CURVE  FITTING 

Throughout  the  experimentation,  it  was  noticed  that  although  one 
particular  curve  did  not  totally  match  another,  there  were  large 
segments  of  the  curves  that  matched  quite  well,  especially  in  the  latter 
segments.   Thus,  instead  of  one  set  of  coefficients  to  represent  an 
audio  curve,  there  might  be  several  representing  various  curve  segments. 
Again,  time  normalization  must  be  considered. 
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D.  HARDWARE  CHANGES 

The  bandpass  filters  used  were  relatively  low  resolution  due  to 
hardware  limitations  imposed  by  the  CI-50U0.   In  order  to  have  better 
filters,  it  would  be  necessary  to  construct  them  trom  component  parts. 
There  is  strong  evidence  that  this  would  help  to  eliminate  the  harmonics 
of  voiced  audio  signals,  which  cause  random  variance  at  different 
frequency  ranges  dependent  upon  the  speaker. 

E.  SECOND  DEGREE  ANALOG  SMOOTHING 

Although  digital  second  degree  smoothing  was  found  to  be  of  no 
practical  value,  this  does  not  mean  that  a  second  degree  analog  smoothing 
circuit  would  react  in  the  same  manner.   Implementing  this  feature 
could  help  to  alleviate  minor  differences  in  audio  curves.   Thus,  a 
closer  coefficient  correlation  could  be  effected. 

F.  INPUT  DATA  CORRELATION 

To  this  point,  all  recommendations  have  concerned  themselves  in 
some  manner  with  coefficient  correlation.   Given  a  time  normalized 
curve,  it  might  be  interesting  to  attempt  data  point  correlation  of 
some  form.   As  was  pointed  out  in  the  section  recommending  segmented 
curve  fitting,  there  were  often  parts  of  the  audio  curves  that  compared 
quite  favorably.   By  looking  only  at  the  associated  data  points,  an 
interesting  type  of  correlation  might  be  accomplished. 

G.  ORTHOGONAL  COEFFICIENT  CORRELATION 

The  current  program  outputs  coefficients  of  the  form  B^ ,  as  described 
in  section  II.   However,  each  B^  is  dependent  upon  all  of  the  orthogonal 

coefficients,  Cj .   The  equation  is  of  the  form:   B.  =  K /T  C.  0.  (x) 

i      j  J   J       . 
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where  the  0^  (x)  are  orthogonal  polynomials.   It  is  obvious  from  this 
that  a  change  in  only  one  C=  will  affect  every  B^.   Therefore,  results 
could  perhaps  be  attained  by  investigating  the  orthogonal  coefficients, 
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FIGURE    5 


A3 


FILTER  1 


FILTER  2 


FILTER   3 


11  =  A001 

12  =  A063 

A1  =  AOOO 
A2   =  A002 

P1  =  P001 
P2  =  POOO 
P3  =  P002 


11  =  A005 

12  =  A007 

A1  =  A010 
A2   =  A006 

P1  =  POO 5 
P2  =  P004 
P3  =  P006 


11  =    A011 

12  =  A013 

A1  =  A014 
A2   =  A016 

P1  =  P011 
P2  =  P012 
P3  =  PO13 


FILTER  4 

11  =  A015 

12  =  A017 

a1    =  iv022 
a2   =  A024 

P1  =  P015 

P2   =  P016 
P3  =  P021 


FILTER   5 

11  =  A031 

12  =  A033 

A1    =  A026 
A2   =  A030 

P1    =   P031 
P2   =  P027 

P3  =  P026 


FILTER  6 

11  =  A041 

12  =  A037 


A1 

a2 

P1 

P2 
P3 


A034 
AO36 

P037 
PO36 

P034 


FILTER  7 

11  =  *045 

12  =  AO47 


A1 
a2 

P1 
P2 
P3 


AO42 
AO44 

P04b 
P044 
PO42 


FILTER  8 

11  =  A053 

12  =  a055 


A1 
A2 

P1 
P2 
P3 


AO50 
A052 

PO'33 
P052 
P050 
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POTENTIOMETER  SETTINGS 


POOO 

.1885 

P001 

.1006 

P002 

.1885 

POO4 

.1885 

PO05 

.1264 

P006 

.1885 

P010 

.1641 

P012 

.1885 

PO13 

.1885 

P015 

.2015 

P0i6 

.1885 

P02I 

.1885 

P026 

.1885 

P027 

.1885 

po;i 

.242^ 

P034 

.1885 

PO36 

.1885 

P037 

.2883 

P042 

.1885 

P044 

.1885 

P045 

•  3376 

P050 

.1885 

PO52 

.1885 

P053 

.3^04 

P406 

.020 

biasing-  pot 

LOWER 

CEliTEu 

UPPER 

J  db 

3  db 

LEVEL 

FKEU. 

LEVEL 

FILtfjfiii 

1 

4^0 

505 

520 

FILTER 

2 

5oO 

575 

590 

FILTEn 

3 

050 

645 

660 

FILTEii 

4 

700 

715 

730 

all   frequencies 

FILTEn 

5 

770 

785 

800 

in  hertz   (Hz) 

FILTER 

b 

840 

855 

870 

FILTEii 

7 

910 

y25 

y4o 

FILTER 

8 

y«o 

Wj 

1010 

TABLE     & 
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